feat: retry failed transaction commit by linguoxuan · Pull Request #576 · apache/iceberg-cpp

linguoxuan · 2026-02-26T10:53:48Z

This commit implements the retry for transaction commits. It introduces a generic RetryRunner utility with exponential backoff and error-kind filtering, and integrates it into Transaction::Commit() to automatically refresh table metadata and retry on commit conflicts.

wgtmac

I've carefully reviewed the retry mechanism and found a few parity issues and a structural data-loss concern regarding how pending updates are held during retries. Please see the inline comments.

src/iceberg/util/retry_util.h

src/iceberg/update/update_snapshot_reference.h

wgtmac · 2026-03-13T14:48:52Z

I just recall a design flaw in the interaction between PendingUpdate and Transaction and created a fix: #591. Without this fix, users have to cache all created pending update instances, otherwise they cannot retry them since they are weak_ptr in the transaction instance.

src/iceberg/util/retry_util.h

wgtmac · 2026-03-18T10:05:29Z

@linguoxuan Could you please rebase to resolve the conflict?

wgtmac

Thanks for the update and fixing the CI! I've just reviewed it for a quick pass.

wgtmac · 2026-03-25T15:10:29Z

src/iceberg/table.cc

  ICEBERG_ASSIGN_OR_RAISE(auto refreshed_table, catalog_->LoadTable(identifier_));
  if (metadata_location_ != refreshed_table->metadata_file_location()) {
    metadata_ = std::move(refreshed_table->metadata_);
+    metadata_location_ = std::string(refreshed_table->metadata_file_location());


This looks like a bug that can be fixed separately and merged quickly

wgtmac · 2026-03-26T14:54:21Z

src/iceberg/transaction.cc

 }

+Result<std::shared_ptr<Table>> Transaction::CommitOnce() {
+  auto refresh_result = ctx_->table->Refresh();


It seems unnecessary to issue a Refresh() call on the first attempt since it may prohibit fast commit. Can we change to only call Refresh during retry?

wgtmac · 2026-03-26T14:55:00Z

src/iceberg/transaction.cc


+Result<std::shared_ptr<Table>> Transaction::CommitOnce() {
+  auto refresh_result = ctx_->table->Refresh();
+  if (!refresh_result.has_value()) {


Let's reuse macros like ICEBERG_RETURN_UNEXPECTED and ICEBERG_ASSIGN_OR_RAISE to write less lines.

wgtmac · 2026-03-26T15:02:56Z

src/iceberg/transaction.cc

-
-    } break;
+  Result<std::shared_ptr<Table>> commit_result;
+  if (!CanRetry()) {


The non-retry path and CommitOnce() both build requirements and call UpdateTable, but through different code paths. Consider unifying: even the non-retry case could go through CommitOnce() with num_retries=0, which would eliminate the branching and reduce maintenance burden.

wgtmac · 2026-03-26T15:12:03Z

src/iceberg/transaction.cc

+    ctx_->metadata_builder =
+        TableMetadataBuilder::BuildFrom(ctx_->table->metadata().get());
+    for (const auto& update : pending_updates_) {
+      auto commit_status = update->Commit();


Should we directly call this->Apply(*update) to avoid an indirection?

wgtmac · 2026-03-26T15:35:00Z

src/iceberg/update/snapshot_update.cc

 int64_t SnapshotUpdate::SnapshotId() {
-  if (!snapshot_id_.has_value()) {
+  while (!snapshot_id_.has_value() ||
+         base().SnapshotById(snapshot_id_.value()).has_value()) {


Doesn't SnapshotUtil::GenerateSnapshotId below have the same loop?

wgtmac · 2026-03-26T15:38:25Z

src/iceberg/util/retry_util.h

+
+#include <algorithm>
+#include <chrono>
+#include <functional>


wgtmac · 2026-03-26T15:40:59Z

src/iceberg/util/retry_util.h

+  }
+
+  /// \brief Specify error types that should stop retries immediately
+  RetryRunner& StopRetryOn(std::initializer_list<ErrorKind> error_kinds) {


It would be good to document the priority between OnlyRetryOn and StopRetryOn.

wgtmac · 2026-03-26T15:44:59Z

src/iceberg/update/pending_update.h

  virtual Kind kind() const = 0;

+  /// \brief Whether this update can be retried after a commit conflict.
+  virtual bool IsRetryable() const { return true; }


I think the Java impl has other types of updates that disallow retry, e.g. UpdateSchema. We might need to evaluate if we want to keep the same behavior.

wgtmac · 2026-03-26T15:50:46Z

src/iceberg/test/retry_util_test.cc

+// --------------------------------------------------------------------------
+// Test: Successful on first attempt — no retries
+// --------------------------------------------------------------------------
+TEST(RetryRunnerTest, SuccessOnFirstAttempt) {


We only have test on RetryRunner. Can we add some integration test directly using Table or Transaction?

wgtmac · 2026-03-27T07:42:57Z

src/iceberg/util/retry_util.h

+  Result<T> Run(F&& task, int32_t* attempt_counter = nullptr) {
+    auto start_time = std::chrono::steady_clock::now();
+    int32_t attempt = 0;
+    int32_t max_attempts = config_.num_retries + 1;


Do we need to validate config_.num_retries?

wgtmac · 2026-03-27T07:45:30Z

src/iceberg/util/retry_util.h

+
+  /// \brief Run a task that returns a Result<T>
+  template <typename F, typename T = typename std::invoke_result_t<F>::value_type>
+  Result<T> Run(F&& task, int32_t* attempt_counter = nullptr) {


Instead of a raw pointer attempt_counter, is it better to incorporate metrics reporter that is being added by @evindj?

wgtmac · 2026-03-27T07:46:42Z

src/iceberg/util/retry_util.h

+  }
+
+  /// \brief Sleep for the specified duration
+  void Sleep(int32_t ms) const {


This function seems unnecessary

linguoxuan force-pushed the main branch 2 times, most recently from 82ada96 to ff6c292 Compare February 26, 2026 11:28

This comment was marked as outdated.

Sign in to view

linguoxuan force-pushed the main branch 6 times, most recently from 2053566 to 05c1625 Compare March 2, 2026 12:11

This comment was marked as outdated.

Sign in to view

wgtmac requested changes Mar 13, 2026

View reviewed changes

src/iceberg/util/retry_util.h Show resolved Hide resolved

src/iceberg/util/retry_util.h Show resolved Hide resolved

src/iceberg/update/update_snapshot_reference.h Show resolved Hide resolved

zhjwpku reviewed Mar 14, 2026

View reviewed changes

src/iceberg/util/retry_util.h Outdated Show resolved Hide resolved

WZhuo reviewed Mar 17, 2026

View reviewed changes

src/iceberg/util/retry_util.h Outdated Show resolved Hide resolved

src/iceberg/util/retry_util.h Show resolved Hide resolved

src/iceberg/util/retry_util.h Outdated Show resolved Hide resolved

src/iceberg/util/retry_util.h Outdated Show resolved Hide resolved

linguoxuan force-pushed the main branch 3 times, most recently from 296d9ed to 79c0218 Compare March 22, 2026 09:27

linguoxuan marked this pull request as draft March 23, 2026 03:54

linguoxuan force-pushed the main branch from 79c0218 to 58df3a4 Compare March 25, 2026 08:44

linguoxuan marked this pull request as ready for review March 25, 2026 09:28

feat: retry failed transaction commit

38e3561

linguoxuan force-pushed the main branch from 58df3a4 to 38e3561 Compare March 26, 2026 02:19

wgtmac requested changes Mar 26, 2026

View reviewed changes

wgtmac reviewed Mar 27, 2026

View reviewed changes

Conversation

linguoxuan commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wgtmac commented Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wgtmac commented Mar 18, 2026

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

linguoxuan commented Feb 26, 2026 •

edited

Loading